منابع مشابه
Graph-Parallel Entity Resolution using LSH & IMM
In this paper we describe graph-based parallel algorithms for entity resolution that improve over the map-reduce approach. We compare two approaches to parallelize a Locality Sensitive Hashing (LSH) accelerated, Iterative Match-Merge (IMM) entity resolution technique: BCP, where records hashed together are compared at a single node/reducer, vs an alternative mechanism (RCP) where comparison loa...
متن کاملDedoop: Efficient Deduplication with Hadoop
We demonstrate a powerful and easy-to-use tool called Dedoop (Deduplication with Hadoop) for MapReduce-based entity resolution (ER) of large datasets. Dedoop supports a browser-based specification of complex ER workflows including blocking and matching steps as well as the optional use of machine learning for the automatic generation of match classifiers. Specified workflows are automatically t...
متن کاملP-Swoosh: Parallel Algorithm for Generic Entity Resolution
Entity Resolution (ER) is a problem that arises in many information integration applications. ER process identifies duplicated records that refer to the same real-world entity (match process), and derives composite information about the entity (merge process). Additionally, the merged record can match another records recursively. Since the ER process is typically compute-intensive, it is import...
متن کاملScalable Entity Resolution Using Probabilistic Signatures on Parallel Databases
Accurate and efficient entity resolution is an open challenge of particular relevance to intelligence organisations that collect large datasets from disparate sources with differing levels of quality and standard. Starting from a first-principles formulation of entity resolution, this paper presents a novel Entity Resolution algorithm that introduces a data-driven blocking and record linkage te...
متن کاملEntity Resolution with Evolving Rules
Entity resolution (ER) identifies database records that refer to the same real world entity. In practice, ER is not a one-time process, but is constantly improved as the data, schema and application are better understood. We address the problem of keeping the ER result up-to-date when the ER logic “evolves” frequently. A naı̈ve approach that re-runs ER from scratch may not be tolerable for resol...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Datenbank-Spektrum
سال: 2012
ISSN: 1618-2162,1610-1995
DOI: 10.1007/s13222-012-0110-x